Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model

نویسندگان

  • Chen Zhang
  • Chunguo Wu
  • Enrico Blanzieri
  • You Zhou
  • Yan Wang
  • Wei Du
  • Yanchun Liang
چکیده

MOTIVATION Mislabeled samples often appear in gene expression profile because of the similarity of different sub-type of disease and the subjective misdiagnosis. The mislabeled samples deteriorate supervised learning procedures. The LOOE-sensitivity algorithm is an approach for mislabeled sample detection for microarray based on data perturbation. However, the failure of measuring the perturbing effect makes the LOOE-sensitivity algorithm a poor performance. The purpose of this article is to design a novel detection method for mislabeled samples of microarray, which could take advantage of the measuring effect of data perturbations. RESULTS To measure the effect of data perturbation, we define an index named perturbing influence value (PIV), based on the support vector machine (SVM) regression model. The Column Algorithm (CAPIV), Row Algorithm (RAPIV) and progressive Row Algorithm (PRAPIV) based on the PIV value are proposed to detect the mislabeled samples. Experimental results obtained by using six artificial datasets and five microarray datasets demonstrate that all proposed methods in this article are superior to LOOE-sensitivity. Moreover, compared with the simple SVM and CL-stability, the PRAPIV algorithm shows an increase in precision and high recall. AVAILABILITY The program and source code (in JAVA) are publicly available at http://ccst.jlu.edu.cn/CSBG/PIVS/index.htm

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of time to the event and nonlinear regression models in the analysis of germination data

Extended abstract   Introduction: Numerous studies are being carried out to reveal the effects of different treatments on the germination of seeds from various plants. The most commonly used method of analysis is the nonlinear regression which estimates germination parameters. Although the nonlinear regression has been performed based on different models, some serious problems in its structure...

متن کامل

Regression Modeling for Spherical Data via Non-parametric and Least Square Methods

Introduction Statistical analysis of the data on the Earth's surface was a favorite subject among many researchers. Such data can be related to animal's migration from a region to another position. Then, statistical modeling of their paths helps biological researchers to predict their movements and estimate the areas that are most likely to constitute the presence of the animals. From a geome...

متن کامل

Prediction of chronological age based on Demirjian dental age using robust ridge regression method

Introduction: Estimation of age has an important role in legal medicine, endocrine diseases and clinical dentistry. Correspondingly, evaluation of dental development stages is more valuable than tooth erosion. In this research, the modeling of calendar age has been done using new and rich statistical methods. Considerably, it can be considering as a practicable method in medical science that is...

متن کامل

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

تشخیص هوشمند و سریع بیماری قلبی بر اساس هم‌افزایی شبکه‌های عصبی خطی و روش رگرسیون منطقی

Background and purpose: Diseases have been the greatest threat for human being along the history. ‎Heart disease (HD) has gained special attention in medical studies. Recently studying on classification and ‎diagnosis of HD as a key topic and a lot of researches have been done in order to increase precise and reduce ‎error in this type of decisions. With development of intelligent learning syst...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 25 20  شماره 

صفحات  -

تاریخ انتشار 2009